A professor carried out an experiment to determine the best calcium level to ensure that fish have low respiration rates. The fish were randomly assigned to three tanks with different levels of calcium.
| Variables | |
|---|---|
| Calcium | A factor denoting the calcium level of the tank, Low, Medium or High |
| GillRate | A number denoting the respiration rate of the fish (gill beats per minute, gbpm) |
$High
Calcium GillRate
Length:30 Min. :37.00
Class :character 1st Qu.:45.75
Mode :character Median :58.50
Mean :58.17
3rd Qu.:68.00
Max. :85.00
$Low
Calcium GillRate
Length:30 Min. :44.00
Class :character 1st Qu.:55.50
Mode :character Median :65.00
Mean :68.50
3rd Qu.:84.75
Max. :98.00
$Medium
Calcium GillRate
Length:30 Min. :33.00
Class :character 1st Qu.:46.00
Mode :character Median :59.50
Mean :58.67
3rd Qu.:68.75
Max. :83.00
Individual t-tests for \(\mu\)
Multiple two-sample t-tests \(\mu_1 - \mu_2\)
Let \(s\) be the sample standard deviation, as defined in T011
The sample variance of a numeric variable is simply the square of the numeric variable’s sample standard deviation, \(s^2\)
Let \(s\) be the sample standard deviation, as defined in T011
The “variability” of a numeric variable, that is, its sums of squares is
\[ \begin{aligned} SSTotal &= (n-1) \times s^2 \\ &= \cdots \\ &= \sum_{i=1}^{n} (x_i - \bar{x})^2 \end{aligned} \]
The basic idea of ANOVA is to split the “variability” of the numeric variable into two (or more) distinct pieces
One-way ANOVA splits the numeric variable into two distinct pieces
If we believe a means-only model is appropriate for the data
\[ SSTotal = SSG + SSR \]
where
However, \(SSG\) and \(SSR\) are not directly comparable. Why?
Let the mean square for groups,
\(MSG\), be defined as
\[ MSG = \frac{SSG}{k- 1} \]
Let the mean square for residuals,
\(MSR\), be defined as
\[ MSR = \frac{SSR}{n-k} \]
where
More on 4.
This means that not only the data has to be approximately symmetrical about the group’s sample mean, \(\bar{x}_k\), and there are no outliers. We also expect the shape of the distribution to be bell-like
Like inference for a single mean or two means, we can be lenient on 4. as the number of observations increases…?!?
Are the assumptions met? (Using only a plot)
Recall that the fish were randomly assigned to three tanks with different levels of calcium.
# Fit the means-only model to the data
lm(GillRate ~ Calcium, data = respiration.df) |>
# Decompose the total "variability" between and within groups
anova()Analysis of Variance Table
Response: GillRate
Df Sum Sq Mean Sq F value Pr(>F)
Calcium 2 2037.2 1018.61 4.6484 0.01208 *
Residuals 87 19064.3 219.13
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[1] 21101.56
Counter-intuitively, a one-sided hypothesis to test whether the observed data provides evidence against all population/underlying/true group means being the same
\[ \begin{aligned} H_0\!: & ~ \mu_1 = \mu_2 = \cdots = \mu_k \\ H_1\!: & ~ \text{At least one} ~ \mu_i \neq \mu_j \end{aligned} \]
The null hypothesis statement means that
The alternative hypothesis statement means that
\[ f_0 = \frac{\frac{SSG}{k-1}}{\frac{SSE}{n-k}} = \frac{MSG}{MSR} \]
where:
When the test statistic is for ratio of two mean squares, \(f_0\), we use the F-distribution to calculate the p-value
The mathematical details relevant for us in DATAX121 is that:
Let \(F\) be the F-distribution with \(\nu_1 = k - 1\) and \(\nu_2 = n - k\)
\(\quad p\text{-value} = \mathbb{P}(F > |f_0|)\)
# Fit the means-only model to the data
lm(GillRate ~ Calcium, data = respiration.df) |>
# Decompose the total "variability" between and within groups
anova()Analysis of Variance Table
Response: GillRate
Df Sum Sq Mean Sq F value Pr(>F)
Calcium 2 2037.2 1018.61 4.6484 0.01208 *
Residuals 87 19064.3 219.13
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
[1] 21101.56